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ABSTRACT 



Aims. We discuss an implementation of our 3D radiative transfer (3DRT) framework with the OpenCL paradigm for 

general GPU computing. 

Methods. We implement the kernel for solving the 3DRT problem in Cartesian coordinates with periodic boundary 

conditions in the horizontal {x, y) plane, including the construction of the nearest neighbor A* and the operator 

splitting step. 

Results. We present the results of a small and a large test case and compare the timing of the 3DRT calculations for 

serial CPUs and various CPUs. 

Conclusions. The latest available CPUs can lead to significant speedups for both small and large grids compared to 

serial (single core) computations. 
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1. Introduction 



In a series of p apers Hauschildt &: BaronI (I2006D: 
iBaron fc Hauschildtj ()2007[^ : iHau schildt fc Baron 



200' 



Baron. H auschil dt. fc Chen 
Hauschildt fc BaronI ()2010t) 



(^20081 

(2009); 

IScclmann. Hauschildt. &: BaronI ( 20101 hereafter: Papers 
I-VH), we have described a framework for the solution 
of the radiative transfer equation in 3D systems (3DRT), 
including a detailed treatment of scattering in continua 
and lines with a non-local operator splitting method, and 
its use in the general model atmosphere package PHOENIX. 

The 3DRT framework discussed in the previous pa- 
pers of this scries requires very substantial amounts of 
computing time due to the complexity of the radiative 
transfer problem in strongly scattering environments. The 
standard method to speed up these calculations is to im- 
plement parallel algorithms fo r distributed memory ma- 
chines using the MPI lib rary ([Hauschildt fc BaronI [200l 
iBaron fc Hauschildtj 120071 ). The development of relatively 
inexpensive graphic processing units (CPUs) with large 
numbers of cores and the development of relatively easy to 
use programming models, OpenCL and CUDA has made 
the use of CPUs attractive for the acceleration of sci- 
entific simulations. CPUs are built to handle numerous 
lightweight parallel threads simultaneously and offer theo- 
retical peak performance far beyond that of current CPUs. 
However, using them efficiently requires different program- 
ming methods and algorithms than those employed on stan- 
dard CPUs. We describe our first results of implement- 
ing our 3 DRT fra.mewo rk for a single geometry within the 
OpenCL ()Munshill2009D paradigm for generalized GPU and 
CPU computing. 



2. Method 

In the following discussion wc use notation of Papers I - 
VII. The basic framework and the methods used for the for- 
mal solution and the solution of the scattering problem via 
non-local operator splitting are discussed in detail in these 
papers and will not be repeated here. Our implementation 
of the 3DRT framework considers the most time consuming 
parts of the process — the formal solution and the solution 
of the operator splitting equations — to obtain the updated 
values of the mean intensities, whereas less time consum- 
ing parts of the code (set-up, Ng acceleration, etc) are left 
to Fortran95 or C implementations. The OpenCL imple- 
mentation of the 3DRT framework minimizes data transfer 
between the host computer (CPU) and the GPU and keeps 
most of the data locally on the GPU memory Only the rel- 
evant input data (e.g., opacities) and the results, e.g., the 
mean intensities J for all voxels, need to be transferred to 
and from the GPU device. 



2.1. General purpose computing on graphic processors 

Using a GPU for numerical calculations requires spe- 
cial programming environments. While GPU manufactur- 
ers have provided programming environments for vendor- 
special hardware, e.g., CUDA (NVIDIA 2007! ) for N VIDIA 
produced CPUs and ATI Stream SDK (AMD 200JJ) (which 
has now been replaced by AMD APP (,AMD. .201ll ) which 
uses OpenCL) for AMD/ATI produced CPUs. The differ- 
ences between these environments make programs specific 
to them. Our applications need to be extremely portable 
and thus having to code for multiple vendor-specific envi- 
ronments is not acceptable. Fortunately, the open standard 
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OpenCL (jMunshil 120091 ) was designed to efficiently use not 
only GPUs but also modern multi-core CPUs and other 
accelerators using a thread-centered programming model. 
With OpenCL it is possible to run the same code on many 
different CPU and GPU architectures. There is a rela- 
tively minor cost of some loss of performance when using 
Op enCL as compared to C UDA specific programming mod- 
els (jKomatsu et al.ir201lD . This is insignificant for our ap- 
plication where portability is far more important than the 
fraction of the theoretical peak performance attained for a 
specific piece of hardware. At the present time, OpenCL is 
available for all types of GPUs and CPUs, including acceler- 
ators such as the Cell Broadband Engine (CBE), whereas 
CUDA is only available for NIVIA GPUs. This is a ma- 
jor concern for us as the technology is progressing rapidly 
and new hardware is released frequently. Using a defined 
standard is, therefore, already important to build a reliable 
code base that can easily be used in the future. Maintaining 
several different codes for the same tasks in different pro- 
gramming languages is, on the other hand, is costly in terms 
of human time and also error prone. The disadvantage of 
this use of general standards is a loss of performance. We 
consider this as a low price for the portability as hardware 
features and performance increase dramatically with new 
hardware. Therefore, we implemented our 3DRT framework 
in OpenCL for portability reasons. 

The design of GPUs differs considerably from the design 
of CPUs, focusing much more on simultaneous execution of 
many threads to hide memory access latencies. In contrast 
to CPUs, branching is costly on GPUs and should therefore 
be limited as much as possible. It is in many cases faster 
to compute both branches of a decision and then select the 
correct one afterwards rather than using conditional exe- 
cution. This is not an uncommon strategy and was used, 
for example, on the original CRAY vector machines in the 
1980s. In addition, GPUs provide better performance for 
regular memory access patterns. The preferred program- 
ming model for these GPU systems is a single-program, 
multiple-data (SPMD) type scheme which is directly sup- 
ported by OpenCL. Branching within a program is allowed 
in this model, but often drastically reduces performance 
and thus should be avoided. 

2.2. Implementation of the formal solution and A* 
computation 

As a first step, we have implemented the "simplest" formal 
solution kernel in OpenCL. This is the kernel for Cartesian 
coordinates with periodic bou ndary conditions in the (hor i- 
zontal) x — y plane discussed in lHauschildt fc BaronI (|2008[) . 
An OpenCL implementation of the formal solution is in 
principle straight forward: For any given direction of pho- 
ton propagation, all characteristics can be tracked simulta- 
neously through the voxel grid, which corresponds in the 
OpenCL paradigm to a 2D kernel (the characteristics are 
started on the inner or outer x — y planes). The only sub- 
stantial hurdle in the problem is that OpenCL (version 
1.0 or 1.1) does not have facilities for atomic updates of 
floating point variables. This is, however, necessary for the 
numerically correct operation of a straight-forward imple- 
mentation of the formal solution for the calculation of the 
mean intensities and the A* operator. Therefore, we have 
implemented a 2-pass kernel, where in the first pass the 
intensities (etc) are computed and stored along the charac- 



teristics which can be implemented with atomic operations 
on integer variables when the results are stored per voxel 
rather than per characteristic. In a second pass, these data 
are collected on a per voxel (3D) OpenCL kernel. With 
this method, the results are identical (to the precision used 
in the OpenCL implementation) to the Fortran95 imple- 
mentation. However, the 2-pass method requires additional 
memory on the GPU to store the intermediate results and 
the first pass generates complex memory access patterns 
which are likely to limit performance on GPU based sys- 
tems. 



2.3. Implementation of the operator splitting step 

The second very time-consuming part of the 3DRT frame- 
work is the solution of the operator splitting equations to 
compute the new values of the mean intensities J for all vox- 
els. The Fortran95 code solves these equations by Jordan 
or Gauss-Seidel iterations. In OpenCL, it is much simpler 
to implement the Jordan method as it requires less syn- 
chronization between threads than does the Gauss-Seidel 
method. The OpenCL implementation uses a 3D kernel (all 
voxels simultaneously) and locally buffers the A* during the 
iterations, which dramatically speeds up the calculations. 

3. Results 

3.1. Test case setup 

The test cases we have investigated follow the continuum 
tests used in[Hauschildt & Baron (2008). In detail, we used 
the following configuration for the tests: Periodic bound- 
ary conditions (PBCs) are realized in a plane parallel slab. 
We use PBCs on the x and y axes, Zmax is at the outside 
boundary, Zmin the inside boundary. The slab has a finite 
optical depth in the z axis. The grey continuum opacity 
is parameterized by a power law in the continuum optical 
depth Tstd in the z axis. The basic model parameters are 

1. The total thickness of the slab, z^ax — ^^min = 10^ cm 

2. The minimum optical depth in the continuum, tJ^™ = 
10~^ and the maximum optical depth in the continuum, 

-max 1 n8 

^std — J^U . 

3. An isothermal slab, T = lO'' K 

4. Boundary conditions: Outer boundary condition I^^^ = 
and inner boundary condition LTE diffusion for all 
wavelengths. 

5. Parameterized coherent & isotropic continuum scatter- 
ing given by 

Xc = ecKc + (1 - ec)crc 

with < ec < 1. Kc and ac are the continuum absorption 
and scattering coefficients. 

For the tests presented here, we use Cc = 10^^ in order to 
allow single precision runs, for smaller epsilons double pre- 
cision is required for the solution of the operator splitting 
equations. 

We have verified that the OpenCL calculations give the 
same result as the Fortran95 CPU calculations for the for- 
mal solution (intensities), the mean intensities and the A* 
as well as that the solution of the 3D radiative transfer 
problem is the same for both OpenCL and Fortran95. For 
OpenCL single precision calculations, the relative accuracy 
is about 10~^, which is acceptable for most calculations. 
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3.2. Timing results 

In Figures HHH we show the timing and speed-up resuhs 
for smaU and large test cases. The difference between the 
tests is simply the size of the voxel grid. In the small case, 
we use GS'^ = 274, 625 voxels, whereas the large test case 
uses a grid with 129^ * 193 = 3, 211, 713 voxels. The small 
test cases uses 95.3 MB OpenCL memory in single precision 
(165.5 MB double precision) and thus fits easily in GPU de- 
vices with little memory, whereas the large test case uses 
1.1 GB OpenCL memory in single precision (1.9 GB in dou- 
ble precision) and can thus only be used on high-end GPU 
devices. The tests were run on a variety of systems, from 
laptops with low-end GPUs to Xeon-based systems with 
dedicated GPU based numerical accelerator boards. For the 
comparisons in the figures, we have selected the fastest CPU 
run as the serial baseline for all comparisons. The systems 
used in the tests are: 

1. serial CPU: Mac Pro with OSX 10.6.4 and Intel Fortran 
11.1, the CPU is a Xeon E5520 with 2.27 GHz clock- 
speed 

2. MPI: 4 processes on Mac Pro with OSX 10.6.4 and Intel 
Fortran 11.1, the CPU is a Xeon E5520 (4 cores) with 
2.27 GHz clock-speed 

3. AMD/ATI Radeon HD 4870 GPU with 512MB RAM 
on a Mac Pro with OSX 10.6.4 OpenCL 

4. NVIDIA GeForce 8600M GT GPU with 512MB RAM 
on a MacBook Pro laptop with OSX 10.6.4 OpenCL 

5. NVIDIA GeForce GT 120 GPU with 512MB RAM on 
a Mac Pro with OSX 10.6.4 OpenCL 

6. NVIDIA Quadro FX 4800 GPU with 1536 MB RAM on 
a Mac Pro with OSX 10.6.4 OpenCL 

7. NVIDIA GeForce 8800GT GPU with 512 MB RAM on a 
Linux PC with NVIDIA OpenCL (OpenCL 1.1, CUDA 
3.2.1) 

8. NVIDIA GeForce GTX 285 GPU with 1024MB RAM 
on a Linux PC with NVIDIA OpenCL (OpenCL 1.0, 
CUDA 3.0.1) 

9. NVIDIA Tesla C2050 Fermi-GPU with 2687 MB RAM 
on a Linux PC with NVIDIA OpenCL (OpenCL 1.1, 
CUDA 3.2.1) 

10. Intel Xeon E5520 at 2.27 GHz clock-speed on a Mac Pro 
with OSX 10.6.4, OpenCL (16 OpenCL threads) 

11. Intel Xeon E5520 at 2.27 GHz clock-speed on a Mac 
Pro with OSX 10.6.4, Intel Fortran 11.1 OpenMP (16 
OpenMP threads) 

All these system were used for the small test case, the large 
test case could only be run on the serial CPU, the 2 OpenCL 
CPU runs, and on the Quadro FX 4800 and Tesla C2050 
GPUs due to memory constraints. 

The results for the small test case show that low-end 
GPUs (GeForce GT 120, GeForce 8xxx) do not provide 
significant speed-up compared to serial CPU calculations. 
However, compared to a laptop class CPU they can be use- 
ful (the GeForce 8600M GT reaches about the speed of the 
serial CPU of the laptop, an Intel T9300 CPU at 2.5GHz) 
as they can be used in parallel with the CPU (e.g., to offioad 
formal solutions for visualizations from the CPU). Medium 
grade GPUs (Radeon HD 4870 or Quadro FX 4800) give 
speed-ups of the order of 4-5 compared to CPUs, which is 
already quite useful for small scale calculations on work- 
stations. High-end GPUs (GeForce GTX 285, Tesla C2050) 
deliver substantially larger speed-ups for the small test case. 



a factor of 28 for the Fermi-GPU based accelerator is very 
significant for calculations. 

For the large case, which is close to the size of a real 
production calculation, we show the timing results in Figs. [3] 
and m The memory requirements of the calculations now 
limit the tests to the CPUs (serial and OpenCL) and the 
Quadro FX 4800 and Tesla C2050 devices. For the OpenCL 
runs on CPUs and the Tesla C2050 runs we also include the 
results for double precision OpenCL calculations. In this 
test, the GPUs deliver larger speed-ups, up to a factor of 36 
for the Tesla C2050 device. Using double precision reduces 
the speed-up to about a factor of 13 (a factor of about 
2.7 slower than single precision), which is presumably due 
to more complex memory accesses and less efficient double 
precision hardware on the Fermi GPU. Running OpenCL 
on CPUs is still not efficient compared to running MPI 
code, but the timings are essentially the same regardless 
of single or double precision. OpenCL is about as efficient 
as using OpenMP shared memory parallelization with the 
same number of threads. Therefore, OpenCL can be used as 
a more versatile replacement for OpenMP code. It has been 
suggested that ultimately CPU s may be only a factor of 2- 
3 faster than multi-core CPUs (|Lee et al.ll2010l) . therefore, 
a careful split of the computational work between CPUs 
and GPUs is probably the most efficient way to use these 
systems. 

Comparing our results to others in the literature is 
not easy, since most other physics and astronomy appli- 
cations solve problems that have very different computa- 
tional characteristics with different levels of difficulty in 
their parallelization on SMP machines, e.g. N-body prob- 
lems (jCapuzzo-Dolcetta et aD 120111: iGaburov et"aLl l2010t) 
where speedups can a factor of 100 or more when a clever 
strategy is adopted for the evaluation of the pairwise force 
in the system by direct summation. On the other hand, in 
molecular dynamic s problems speedu ps of 20-60 are con- 
sidered acceptable (jChen et al.ll2009[ ). The extremely non- 
local nature of the radiative transfer problem makes the 
kernels extremely complex. Therefore, in order to retain 
numerical accuracy and portability, some fraction of the 
theoretical speedup must be sacrificed. 

We note that we have chosen OpenCL for its porta- 
bility and we have so far not tried to fully optimize our 
kernels for the specific architecture as new features and 
fundamental changes are introduced very frequently into 
new hardware and better OpenCL compilers will reduce 
th e importance of hand-t uning the OpenCL code reported 
by iKomatsu et al] (|2011[ ). Studies have shown that hand- 
tuned optimizations can lead to OpenCL performance ap- 
proaching that given by using vendor specific software 
() Weber et all 1201 It iKomatsu etaIll201lD . but in this stiU 
early state of general computing on GPUs this will change 
with new versions of the OpenCL framework and better 
OpenCL compilers. One of the main issues concerning fur- 
ther optimization is an inherent problem of the formal so- 
lution: for each solid angle (direction) any characteristic 
passes through a large fraction of the voxel grid, resulting 
in highly complex memory access patterns that also vary 
from one direction to another and from one characteris- 
tic to another. This is a basic feature of radiative trans- 
fer. Using approaches that maximize data locality work on 
CPUs (with a small number of cores) but are very ineffi- 
cient on GPUs as many PEs may be idle at any given time 
(again, depending on the direction). We have done a num- 
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ber of experiments with different approaches and found that 
the parahel tracking implementation described above is a 
good overaU compromise that keeps many OpenCL threads 
active and aUows the GPU hardware to mask many mem- 
ory latencies. On CPUs, the impact is even smaller as the 
number of cores tends to be small and complex memory 
access patterns are handled more efficiently than for a sin- 
gle thread on a GPU. With this the algorithm performs 
best on newer GPU hardware compared to older hardware, 
which is encouraging for future devices. A promising venue 
for further optimization requires the availability of atomic 
updates for floating point numbers in OpenCL, this could 
remove the need for a two pass approach in the formal so- 
lution and may improve performance. 



4. Summary and Conclusions 

We have described first results we have obtained imple- 
menting our 3DRT framework in OpenCL for use on CPUs 
and similar accelerators. The results for 3D radiative trans- 
fer in Cartesian coordinates with periodic boundary condi- 
tions show that high-end CPUs can results in quite large 
speed-ups compared to serial CPUs and are thus useful to 
accelerate complex calculations. This is in particular use- 
ful for clusters where each node has one GPU device and 
where calculations can be domain-decompositioned with a 
one node granularity. Large scale calculations that require 
a domain-decomposition larger than one node are more ef- 
ficient on large scale supercomputers with lOOO's of cores 
as data transfer required for simultaneous use of multiple 
GPUs on multiple nodes will dramatically reduce perfor- 
mance. Even medium-end or low-end GPUs can be useful 
to offioad calculations from the CPU to speed up the overall 
calculations. 
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Fig. 1. Timing of the small 3D radiative transfer test calculation on CPUs (leftmost column), various CPUs with 
OpenCL, and multi-core CPUs using OpenCL. The times are given in seconds of wallclock time. 
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Fig. 2. Speed-ups of the small 3D radiative transfer test calculation for the OpenCL implementation relative to the 
serial CPU run. 
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Fig. 3. Timing of the large 3D radiative transfer test calculation on CPUs (leftmost column), various CPUs with 
OpcnCL, and multi-core CPUs using OpenCL, MPI and OpenMP. The MPI calculation was run on 4 cores (1 CPU), 
the OpenMP run used 16 threads (8 cores, inch hyperthrcading) to be comparable to the OpcnCL CPU run. The times 
are given in seconds of wallclock time. 
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Fig. 4. Speed-ups of the large 3D radiative transfer test calculation for the OpenCL. OpenMP, and MPI implementations 
relative to the serial CPU run. The MPI calculation was run on 4 cores (1 CPU), the OpenMP run used 16 threads (8 
cores, inch hyperthreading) to be comparable to the OpenCL CPU run. 



